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Cancer diagnosis and classification have traditionally been based on the assessment of morphology by 
microscopy. However, the histological classification system is challenging and demand for genetic 
information is increasing in the era of targeted and personalized molecular therapy. Recently accumulated 
comprehensive genomic data could be used to provide a molecular cancer classification alongside the 
histological classification. This study identified a 19 gene signature able to classify endometrial cancers into 
the two major histological subtypes, endometrioid and serous. In addition, when the genomic classifier was 
applied to endometrioid adenocarcinoma of high grade (EM-HG), a subset (23.6%, 25/106) was predicted to 
be similar to serous tumors at the molecular level. In analyses of multiple cancers, the classification model 
may also be applicable to ovarian cancers. 

Endometrial cancer is the most common gynecological malignancy in Western Europe and the USA, and is 
the seventh most common cancer in women worldwide 1,2 . The incidence of endometrial cancer has 
increased steadily in correlation with the current epidemic of obesity 3,4 . Endometrial cancers are classified 
by their histologies. A dualistic model classifying endometrial cancers into type I and type II is widely accepted to 
explain the pathogenesis 5,6 . 

Type I endometrial cancers comprise the majority of endometrial cancers. They occur on a background of 
unopposed estrogen overstimulation, have endometrioid histology resembling a normal endometrial gland, and 
are usually diagnosed as low-grade endometrioid adenocarcinoma (EM-LG). By contrast, type II endometrial 
cancers exhibit non- endometrioid histology, such as serous adenocarcinoma (EM-Serous), and are frequently 
associated with TP53 mutations and an aggressive clinical course. Despite the differences in pathobiology and 
tumor behavior between the two types of endometrial cancers, differential diagnosis using the current histological 
classification system is frequently challenging 7,8 . In particular, this classification scheme does not always provide a 
clear distinction of tumor type in cases of high-grade endometrioid adenocarcinoma (EM-HG), referred to as 
FIGO (International Federation of Gynecology and Obstetrics) grade 3 endometrioid adenocarcinoma 6,8 . The 
distinction between type I and type II endometrial cancer is important because different treatments are recom- 
mended for each type, and a different clinical course is observed. 

Recently, comprehensive genomic data, including whole exome sequencing, RNA sequencing, and large-scale 
copy number alteration (CNA), have become available for various cancers. These genomic data could be used to 
provide a molecular cancer classification or identify molecular cancer markers, and, in this era of personalized 
cancer therapy, to provide practical methods to build bridges between genomics and clinical practice. 

In this study, we built classification models to predict the two major histologies of endometrial cancer using 
whole exome sequencing data, RNA sequencing data, and global copy number data obtained from the TCGA 
database (http://tcga-data.nci.nih.gov): type I endometrial cancers, i.e., tumors with endometrioid histology, and 
type II endometrial cancers, i.e., tumors with serous histology. These classification models were then compared to 
identify the best predictive model with the highest accuracy. The selected classification model was verified using 
an independent external data set, and the classification model was then applied to EM-HG, mixed-type and 
multiple cancers, including ovarian serous adenocarcinoma (Ov-Serous) and eight non-gynecological cancers. 

Results 

Classification model building using expression data for endometrial carcinomas. To build a classification 
model that is able to discriminate between endometrioid histology and serous histology in endometrial cancer, 



SCIENTIFIC REPORTS | 4:5174 | DOI: 1 0. 1 038/srep05 1 74 



1 




ROC curve 



AUC=0.988 (CCP) 
AUC=0.988 (DLDA) 
AUC=0.983 (BCCP) 



■ CCP 
- DLDA 
BCCP 



(c) 



T3 

*i at 

a s 

O Q. 

u 



9643 3796 1211 342 104 



7 5 2 1 0 



Endometroid 



Serous 



c o 

V) LU 



Value of threshold 



9543 3796 1214 342 104 39 20 



7 5 2 1 0 



0.4 0.6 

1 -specificity 



(d) 



■ ■8 u '° 
> 2 

■ * 0.4 



1 ° 

2 °- 
u 



(e) 



0.2 




Sample 



Value of threshold 





Endometroid 


1 Serous 




♦ * 












j • 






» 




















0 


50 


100 150 


200 




Endometroid 


! Serous 


«» 
















• 










** .♦ 


V.. • -r 




0 


50 


100 150 


200 



(0 



Classical classifiers 



PAM 



PIGR 


/ L1CAM \ 

/ IHH \ 
TFF3 \ 
SPDEF \ 


C90RF152 \ 
FOX A 2 




\ CLom p 

\ KIAA1324 / 





RSP04 CNGA1 
PLAG1 HHAT 
FCH01 UQCR11 
DHCR24 MDM2 

SORBS2 
PPAP2C 



Ad a boost 



(g) 



i 

0.8 
0.6 
0.4 
0.2 
0 



0° 



/ ° if f f ? 



J 



■ sensitivity 
0 specificity 

■ PPV 

□ NPV 



Figure 1 | Classification modeling using expression and copy number data for histological subtypes of endometrial cancer. ( a) All nine classifiers 
showed a high performance, with high sensitivity, specificity, positive predicted value (PPV), negative predicted value (NPV), and (b) AUC values, (c) 
Ten-fold cross-validated probabilities for each class and misclassification error rates are shown for the PAM method, (d) Other methods such as the 
Bayesian compound covariate and (e) the Adaptive boosting method also showed the distinctive two probability patterns for endometrioid or serous 
histologies, (f) Gene lists selected by PAM, Adaptive boosting, and the remaining seven classifiers (classical classifiers) are shown, (g) Performance of the 
modeling using actual copy number data. 



low-grade (FIGO grades 1 and 2) endometrioid adenocarcinoma 
(EM-LG, N = 184) and endometrial serous adenocarcinoma (EM- 
Serous, N = 52) were defined as binary endpoints. The results of 10- 
fold cross-validation (CV) using all nine classifiers showed high 
performances (permutation p-value < 0.001) with high sensitivities, 
specificities, positive predictive value (PPV), and negative predictive 



value (NPV), irrespective of classifiers (Fig. la). Ten-fold cross- 
validated receiver operating characteristic (ROC) curves showed 
high area under curve (AUC) values of up to 0.988 (Fig. lb). Ten- 
fold cross-validated probabilities for prediction of histologies 
between EM-LG and EM-Serous were distinctively distributed into 
two patterns (toward 100% for the serous type or 0% for the 
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Figure 2 | Classification modeling using mutation and binary copy number alteration data for the histological subtypes of endometrial cancer. 

Distribution patterns by principal component (PC) and accuracy are shown for each model using mutation or copy number variation (CNV), or 
mutation + CNV, respectively. 



endometrioid type) that were clearly correlated with the actual 
histological types using classification methods such as Prediction 
Analysis for Microarrays (PAM) (Fig. lc), Bayesian compound 
covariates (Fig. Id), and Adaboost classifiers (Fig. le). In the gene 
selection from the full data set, seven genes were selected by classical 
classifiers, eight genes by PAM, and 12 genes by Adaboost from the 
full data set (Fig. If). Of these, two genes, CLDN6 and KIAA1324, 
overlapped in all classification methods (Fig. If), suggesting that they 
are the most important in the morphogenesis of endometrial cancer. 
Detailed statistics are described for each gene in Supplementary 
Tables 1-3. 

Classification model building using mutation and/or CNA data. 

We also built classification models using CNA data and/or mutation 
data. For CNA data, two models were built with the two possible data 
types: actual copy number data (continuous data) and GISTIC 
results (binary data). When the actual copy number data classifi- 
cation model was built, a relatively lower sensitivity was present 
across all classifiers, except for PAM and Adaboost, compared to 
the expression data classification model (Fig. lg). For the GISTIC 
or mutation classification models, reduced sensitivity was also 
observed (Fig. 2). However, for models combining mutation and 
GISTIC data, sensitivity increased to 0.83 (permutation p-value < 
0.001) but remained lower than the sensitivity observed for the 



expression data model. Overall, the distribution patterns between 
EM-LG and EM-Serous were well separated, as shown by the 
principal component (PC) 1 and PC 2 (Fig. 2). The selected genes 
for mutation and CNA data are summarized in Supplementary Table 
4. Specifically, two mutated genes were selected: PTEN mutations for 
endometrioid tumors (+beta value) and TP53 mutations for serous 
tumors (—beta value). For the CNAs, ten genes exhibiting amplifi- 
cation, ABHD16B, BCL7C, BRD4, CCNE1, DNM2, FOSL2, GALK1, 
PGAP3, TERC, and ZMYND8, were selected and all indicated the 
serous type (—beta value). There was a significant correlation 
between the models that used the expression data and the models 
that used mutation + CNA data (p-value = 0.003; Supplementary 
Table 5). 

External validation of the models using expression data. The 

expression data classification model showed the best performance. 
We focused on this classification model since producing expression 
data is easier and cheaper than detecting mutations and CNAs. The 
established classification model using expression data was applied to 
an external independent data set generated using a microarray 
containing 63 EM-LG and 12 EM-Serous. High sensitivity and 
specificity were obtained for most classifiers, although two classi- 
fiers including 3 -Nearest Neighbors and Support Vector Machine 
showed high error rate for serous (Fig. 3a). For each individual 
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Figure 3 | External validation of the expression data model, (a) High sensitivities and specificities were identified using an independent external 
validation set. (b) The predicted type and original histology of each sample are demonstrated, (c) Sensitivities and specificities and (d) subgrouping using 
consensus clustering methods. 



sample, detailed prediction results using the classification models are 
shown according to the different classifiers (Fig. 3b). The high 
correlations between the results predicted by the classification 
model and the original histology are presented. To compare the 
classifiers, we also applied hierarchical and non-negative matrix 
factorization (NMF) consensus clustering using 19 selected genes. 
Consensus clustering analysis showed lower specificity than the 
classifiers (Fig. 3c). In this analysis, the samples were divided into 
two groups (k = 2). All samples with EM-Serous histology were 
found in one group; however, this group also contained many 
samples with EM-LG histology (Fig. 3d). 

Application of the model to EM-HG and mixed endometrial 
carcinoma. This classification model is able to distinguish endome- 
trioid from serous histology in endometrial cancers with a high 
performance and accuracy. We applied this classification model to 
EM-HG and endometrial cancers that were originally classified as 
mixed histology. When the model was applied to samples with mixed 



histology (N = 10), three (30%) samples were classified as endome- 
trioid and the remaining seven (70%) were classified as serous 
(Fig. 4a). When we reviewed the histology for the three available 
tumors, the histology of the sample predicted as serous was consis- 
tent with it being of serous histology, and the sample predicted as 
endometrioid showed predominantly endometrioid histology by 
hematoxylin and eosin (H&E) staining (Fig. 4a, H&E slides, upper 
and mid). One sample predicted as serous showed mixed pattern of 
serous and endometrioid histology (Fig. 4a, H&E slide, lower). The 
classification model was also applied to EM-HG, as it can be 
uncertain whether EM-HG belongs to the endometrioid or serous 
type of tumor. Of the 106 samples with EM-HG histology, the 
classifiers predicted that 25 (23.6%) samples were serous and the 
remaining 81 (76.4%) were endometrioid (Fig. 4b). Although 
the 106 samples were originally diagnosed as endometrioid type 
with FIGO grade 3, a subset of samples were reclassified as serous, 
suggesting that the reclassified samples are more similar to EM- 
Serous than endometrioid tumors at the molecular level. When the 
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Figure 4 | Application of the classification model to mixed endometrial carcinoma and high-grade endometrioid adenocarcinoma (EM-HG). (a) In the 

mixed endometrial carcinoma, the predicted type and reviewed histology are well correlated, (b) In the EM-HG, a subset of samples were predicted 
as serous tumors and (c) the model prediction tended to be correlated with clustering using a whole gene expression pattern. 



clustering analysis was performed using all endometrial cancers, 
including samples with EM-LG, EM-HG, and EM-Serous histo- 
logy, EM-HG samples within the EM-LG cluster tended to be 
predicted as endometrioid, and EM-HG samples within the EM- 
Serous cluster tended to be predicted as serous (Fig. 4c). 

Application of the model to multiple cancers. Finally, we applied 
the classification model to nine cancer types. The median values of 
the probabilities were around 50%, although colorectal cancers 
tended to be classified as endometrioid as opposed to serous. Most 
ovarian serous adenocarcinomas were predicted to be serous with 
high probabilities (Figure 5a). When the clustering analysis was 
performed for nine cancer types (N = 3766), some cancer types 
even derived from the same organ were clustered differently, 
irrespective of tumor origin; however, endometrial cancers and 
ovarian cancers clustered together (Fig. 5b), suggesting that our 
model can also be applied to ovarian cancer. 

Discussion 

In this study, we constructed a classification model using mutation 
data, CNA data and expression data to distinguish between histolo- 
gical subtypes of endometrial cancer. Recent high-throughput 
sequencing technology has generated an enormous amount of data 
for somatic mutations and CNAs in a range of cancer types. The 
development of models that use somatic mutation data or CNA data 
is an effective clinical use of genomic data. In our previous study, we 
made a predictive model using a somatic mutation profile to predict 
patient survival in ovarian cancer'. In this study, we showed that 
classification models using mutation and/or discrete copy number 
data are effective and applicable. These models, using binary data, 
show high performance, although sensitivity is lower than that of the 



expression data model. The lower sensitivity of the mutation or copy 
number data models is probably due to the low frequency of muta- 
tions in most genes and to the high frequency of CNAs in EM-Serous 
but not in EM-LG tumors 10 . Among the mutations, only the well- 
known mutations PTEN and TP53 were included in the classifi- 
cation model: PTEN mutations occur frequently in endometrioid 
adenocarcinoma, and TP53 mutations frequently occur in serous 
adenocarcinoma. 

Our study suggests that expression patterns using up to 19 genes 
are able to classify endometrial cancer into two subgroups: endome- 
trioid and serous. Of the 19 genes, six genes (L1CAM, CLDN6, 
RSP04, PLAG1, FCHOl, and DHCR24) were elevated in EM- 
Serous, and the remaining 13 genes (PIGR, IHH, TFF3, SPDEF, 
KIAA1324, C90RF152, FOXA2, CNGA1, HHAT, UQCRU, 
MDM2, SORBS2, and PPAP2C) were elevated in EM-LG. Among 
them, CLDN6 and KIAA1324 were consistently selected in all nine 
classifiers, suggesting that these genes are the most important in the 
morphogenesis of endometrial cancer. KIAA1324 is induced by 
estrogen and is a good endometrial biomarker associated with a 
hyperestrogenic state and estrogen-related type I endometrial adeno- 
carcinoma 11 . The two types of endometrial adenocarcinomas can be 
distinguished by claudin 1 (CLDN1) and claudin 2 (CLDN2) 12 . 
Therefore, the related gene CLDN6 may also be a useful marker to 
discriminate between the two groups. 

In this study, a high error rate was present in predicting the serous 
type of cancer in a few classifiers. Possible reasons include the fol- 
lowing: 1) certain classifiers, such as Support Vector Machine, may 
be inappropriate for binary classification schemes in this data set; 2) 
unequal distribution of samples between the endometrioid and ser- 
ous types (fewer serous than endometrioid tumors); and 3) model 
building relied on RNA sequencing data (TCGA data set), which 
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Figure 5 | Application of the model to multiple cancer types, (a) When the established model was applied to nine types of cancers, most ovarian serous 
adenocarcinomas were predicted as being of serous histology with high probabilities, (b) When clustering was performed (N = 3766), some cancer 
types of even the same organs failed to cluster, indicating heterogeneity, however, endometrial cancers and ovarian cancers clustered together. 
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have a larger dynamic range of expression than the independent 
microarray data used for validation. Another possible reason is 
potentially flawed histology since serous tumors may be confused 
with high-grade endometrioid carcinoma. In addition, different 
molecular profiles may be present in the same tumor, with increased 
heterogeneity in high-grade tumors such as serous carcinomas. In 
this study, we used nine classifiers for classification modeling. 
Among them, two classifiers, 3 -Nearest Neighbors and Support 
Vector Machine, showed a high error rate for serous cancers. The 
remaining seven classifiers showed zero or very low error rates for 
prediction of the serous type, which suggests that the cause of the 
error rate lies with a few classifiers rather than with the histology or 
the data. Therefore, the use of several classifiers may be important to 
minimize misclassification in clinical application. In addition, cor- 
relation between histologic and molecular classification is important 
to lead to the correct diagnosis in clinical application. 

When the model was applied to the ten mixed-type histologies, 
three (30%) samples were classified as endometrioid and the remain- 
ing seven (70%) were classified as serous. Although we did not review 
the histology in all cases, one case predicted as endometrioid showed 
predominantly endometrioid histology and two cases predicted as 
serous contained serous areas, which suggests that the classification 
model correlates with histology. In this study, one of the interesting 
findings is a discrepancy between histologic diagnosis and classifica- 
tion by modeling in EM-HG, 23.6% of which were classified as ser- 
ous. Many samples were reclassified as serous using the classification 
model, which suggests that a subset of EM-HG may be more similar 
to the EM-Serous type than to the endometrioid type at the molecu- 
lar level. This finding also suggests that EM-HGs may be more mole- 
cularly and histologically heterogeneous than initially thought. 

We performed clustering analysis for multiple cancers and found 
that the cancers did not cluster by tumor origin. This suggests that 
classification according to molecular features is more effective for 
treatment. We also found that endometrial cancers and ovarian can- 
cers were closely linked and clustered together. Therefore, our model 
may be applicable to the classification of ovarian cancers having 
histological subtypes of serous and endometrioid adenocarcinoma. 

In summary, we created endometrial cancer classification models 
using different platforms and validated the models. A model using a 
19 gene signature was able to classify endometrial cancers into the 
two major histological subtypes. Classification models using geno- 
mic data may complement histology in establishing diagnoses, and 
this study also suggests that using multiple classifiers could be 
important to minimize misclassification in clinical application. In 
the era of targeted molecular therapy, it is potentially important to 
report molecular classification predictions alongside histological 
classifications. 

Methods 

Genomic data. For endometrial cancer modeling, the following genomic data were 
used. 

Expression data: mRNA expression data were derived from the RNA seqV2 RESM 
for endometrial cancer (N = 370), breast cancer (N — 914), colorectal cancer (N — 
243), glioblastoma (N = 153), lung adenocarcinoma (N = 230), lung squamous cell 
carcinoma (N = 347), melanoma (N = 282), renal cell carcinoma (N — 480), Ov- 
Serous (N = 261), and thyroid papillary carcinoma (N — 486). The data were nor- 
malized using quantile normalizations with log 2 transformations. 

Mutation data: for modeling, all observed somatic mutations across all sequenced 
cases (N. = 232) and mutated genes, as determined by a merger of the MutSig v2.0 
and MutSigC V v0.9 (Q-value < 0.1) test results' 0 , were used. There are 29 genes in the 
list of genes significantly mutated in endometrial cancer {Supplementary Table 6). 

Copy number data: segmented copy number data generated using an Affymetrix 
Single Nucleotide Polymorphism (SNP) 6.0 array (N = 492) were used. The down- 
loaded segmented copy number data were analyzed with GISTIC2.0 1314 to identify 
significant focal CNAs. The thresholds for significant focal CNAs were as follows: 
amplification and deletion threshold, 0.1; cap-values, 1.5; broad length cut-off, 0.7; 
confidence level, 0.95; joint segment size, 4; level peel-off, 1; and maximum sample 
segments, 2000. Details of each of these parameters have been previously described 14 . 
For modeling using CNA data, both actual copy number data and discrete copy 
number results determined by GISTIC were used. For the discrete copy number 



results with high-level amplifications or homozygous deletions, a one-tail Wilcoxon 
signed-rank test was used to filter the cases in which mRNA expression values were 
significantly higher or lower in amplified or homozygously deleted samples versus 
diploid samples. 

Classification modeling and internal validation. The methods used to build the 
classification models using continuous variables, such as expression data or actual 
copy number data, were as follows: compound covariate predictors, diagonal linear 
discriminant analysis, 1 -Nearest Neighbor, 3-Nearest Neighbors, Nearest centroid, 
Support Vector Machine (SVM), Bayesian compound covariates, class prediction 
using PAM' 5 , and the Adaptive boosting (Adaboost) method 16 . Genes with significant 
differences between the two classes (t-test, p < 0.001) and genes in which the fold- 
difference between the two classes exceeded 30 were used to fit the classification 
model. For binary outcome data, such as mutation data and the discrete CNV data 
from GISTIC, we used a classification method combining Fisher's exact test and 1- 
norm SVM 17 to predict the binary response using mutation and/or CNA data. We 
selected the significant genes (Fisher exact test p < 0.001) before fitting the 
classification model. A chi- square test was used for testing the association between the 
original and predicted responses. To evaluate the predictive performance of the 
classification models, 10-fold CV procedure was used as follows: 

Step 1. The total data were randomly divided into ten equally sized subsets. 

Step 2. A single subset was used as the validation data, and the remaining nine 
subsets were used as training data. 

Step 3. The significant genes (t-test, p < 0.001 for continuous data, or Fisher exact 
test, p < 0.001 for binary data) were selected from the training set. 

Step 4. The classification method was applied to the selected genes and a clas- 
sification model was fitted. 

Step 5. A fitted classification model was applied to the validation data and the 
responses were predicted. 

Step 6. Steps 3-5 were repeated ten times. 

Step 7. The chi-square p-value was calculated using the original and predicted 
responses. 

To remove the overfitting bias of the 10-fold CV, we calculated a permutation p- 
value, as in Simon et al. 18 and Pang and Jung 19 , as follows: 1) the naive chi-square p- 
value (P 0 ) was computed from the 10-fold CV procedure for the original data, 2) the 
chi-square p-value (Pf,) was computed from the 10-fold CV procedure for the b-th 
permuted data (b — 1, B), and 3) a permutation p-value was calculated using the 
equation p-B" 1 Yf b= i I(P b <P 0 ). 

Measurement of the accuracy of the predictive model. Cross-validated ROC curves 
and AUC values were used. The performance measurements used were sensitivity 
(the probability that the EM-Serous would be correctly predicted as EM-Serous), 
specificity (the probability that the EM-LG would be correctly predicted as EM-LG), 
PPV (the probability that a sample predicted as EM-Serous actually belongs to EM- 
Serous), and NPV (the probability that a sample predicted as EM-LG actually belongs 
to EM-LG). In addition, normalized gene expression data from 91 stage I endometrial 
cancers derived from the Affymetrix Human Genome U133 Plus 2.0 20 were used for 
external validation. 

Microscopy imaging data. Three available histological images of mixed-type 
endometrial cancer (TCGA-AX-A1CR, TCGA-BK-A0CA, and TCGA-D1-A0ZZ) 
were obtained from Berkeley Cancer Morphometric Data (http://tcga.lbl.gov/biosig/ 
tcgadownload.do). 

Consensus clustering for expression data. A consensus hierarchical and NMF 
clustering with iterative feature selection was performed. Consensus clustering is a 
resampling-based procedure that repeatedly samples a sample subset and then uses 
clustering to find intrinsic groupings 21 ' 22 . Consensus clustering records the 
proportion of resamplings in which pairs of tumors were in the same clusters 21,22 . 
NMF is an algorithm based on decomposition by parts that can reduce the dimension 
of the data; it is also an efficient method for the identification of distinct molecular 
patterns and provides a powerful method for class discovery 23 . These algorithms have 
been previously described 23 ' 24 . 

Statistical analysis and data mining. Modeling, data analysis, and data mining were 
performed using the BRB array tool 25 and R-program (version 2.14.2; www.r-project. 
org). Consensus clustering analysis was performed using GenePattern from the Broad 
Institute with the "ConsensusClustering" and "NMFConsensus" modules and 
pipelines 26 . Statistical analyses for association tests were performed using Stata/IC 
statistical software (version 12; StataCorp, TX) or the R-program (version 2.14.2; 
www.r-project.org) . 
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